The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families

نویسندگان

  • Shibu Yooseph
  • Granger Sutton
  • Douglas B Rusch
  • Aaron L Halpern
  • Shannon J Williamson
  • Karin Remington
  • Jonathan A Eisen
  • Karla B Heidelberg
  • Gerard Manning
  • Weizhong Li
  • Lukasz Jaroszewski
  • Piotr Cieplak
  • Christopher S Miller
  • Huiying Li
  • Susan T Mashiyama
  • Marcin P Joachimiak
  • Christopher van Belle
  • John-Marc Chandonia
  • David A Soergel
  • Yufeng Zhai
  • Kannan Natarajan
  • Shaun Lee
  • Benjamin J Raphael
  • Vineet Bafna
  • Robert Friedman
  • Steven E Brenner
  • Adam Godzik
  • David Eisenberg
  • Jack E Dixon
  • Susan S Taylor
  • Robert L Strausberg
  • Marvin Frazier
  • J. Craig Venter
چکیده

Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Probing Metagenomics by Rapid Cluster Analysis of Very Large Datasets

BACKGROUND The scale and diversity of metagenomic sequencing projects challenge both our technical and conceptual approaches in gene and genome annotations. The recent Sorcerer II Global Ocean Sampling (GOS) expedition yielded millions of predicted protein sequences, which significantly altered the landscape of known protein space by more than doubling its size and adding thousands of new famil...

متن کامل

The Sorcerer II Global Ocean Sampling Expedition: Metagenomic Characterization of Viruses within Aquatic Microbial Samples

Viruses are the most abundant biological entities on our planet. Interactions between viruses and their hosts impact several important biological processes in the world's oceans such as horizontal gene transfer, microbial diversity and biogeochemical cycling. Interrogation of microbial metagenomic sequence data collected as part of the Sorcerer II Global Ocean Expedition (GOS) revealed a high a...

متن کامل

The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific

The world's oceans contain a complex mixture of micro-organisms that are for the most part, uncharacterized both genetically and biochemically. We report here a metagenomic study of the marine planktonic microbiota in which surface (mostly marine) water samples were analyzed as part of the Sorcerer II Global Ocean Sampling expedition. These samples, collected across a several-thousand km transe...

متن کامل

Evolutionary dynamics of clustered irregularly interspaced short palindromic repeat systems in the ocean metagenome.

Clustered regularly interspaced short palindromic repeats (CRISPRs) form a recently characterized type of prokaryotic antiphage defense system. The phage-host interactions involving CRISPRs have been studied in experiments with selected bacterial or archaeal species and, computationally, in completely sequenced genomes. However, these studies do not allow one to take prokaryotic population dive...

متن کامل

The capsid of the T4 phage superfamily: the evolution, diversity, and structure of some of the most prevalent proteins in the biosphere.

The Escherichia coli bacteriophage T4 has served as a classic system in phage biology for more than 60 years. Only recently have phylogenetic analyses and genomic comparisons demonstrated the existence of a large, diverse, and widespread superfamily of T4-like phages in the environment. We report here on the T4-like major capsid protein (MCP) sequences that were obtained by targeted polymerase ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • PLoS Biology

دوره 5  شماره 

صفحات  -

تاریخ انتشار 2007